2024 iThome 鐵人賽

DAY 9

生成式 AI

Gemini 多模態大型語言模型大小事系列第 9 篇

Gemini 多模態大型語言模型大小事 Day9 - 探索 Gemini API 的音訊功能

16th鐵人賽

kevin_chiu

2024-09-19 19:27:10

789 瀏覽

分享至

前言

    程式環境都會用colab 來執行程式，如果要在其他環境執行，請自行修改哦

colab 事前準備：設定專案和 API 金鑰
載入gemini

#pip install -q -U google-generativeai
import google.generativeai as genai

API 金鑰

from google.colab import userdata
API_KEY=userdata.get('GOOGLE_API_KEY')

#genai.configure(api_key="YOUR_API_KEY")

# Configure the client library by providing your API key.
genai.configure(api_key=API_KEY)

探索 Gemini API 的音訊功能

Gemini 可以回覆語音提示。舉例來說，Gemini 可以：

說明、總結或回答音訊內容相關問題。
提供音訊的轉錄稿。
提供有關音訊特定片段的解答或轉錄稿。

支援的音訊格式 Gemini 支援下列音訊格式 MIME 類型：

WAV - 音訊/WAV
MP3 - 音訊/mp3
AIFF - 音訊/AI
AAC - 音訊/AAC
OGG Vorbis - 音訊/ogg
FLAC - 音訊/flac

音訊的相關技術詳細資料 Gemini 對音訊有下列規則：

Gemini 以 25 個符記表示的每一秒音訊；例如 1,500 個符記表示 1 分鐘的音訊。
Gemini 只能推論英語的回覆。
Gemini 可以「理解」例如鳥鳴或警笛聲等
單一提示支援的音訊資料長度上限為 9.5 小時。 Gemini 不會在單次提示中限制音訊檔案數量；不過在單一提示中，所有音訊檔案的總長度不得超過 9.5 小時。
Gemini 會將音訊檔案取樣至 16 Kbps 的資料解析度，
如果音訊來源含有多個聲道，Gemini 會合併這些聲道明確轉換為單一管道

將音訊檔案提供給 Gemini

使用 File API 上傳音訊檔案

curl -o sample.mp3 https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3

# Upload the file.
audio_file = genai.upload_file(path='sample.mp3')

將上傳檔案的提示傳送給 Gemini

# Initialize a Gemini model appropriate for your use case.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Create the prompt.
prompt = "請用繁體中文總結一下演講內容。"

# Pass the prompt and the audio file to Gemini.
response = model.generate_content([prompt, audio_file])

# Print the response.
print(response.text)

回答

演講主要談論了美國當時所面對的內外困境，包含經濟蕭條、失業率攀升、聯邦預算赤字、冷戰局勢等問題。演講者強調美國依然強大，並提出了一些解決方案，例如：增強軍力、改善經濟、加強國際合作、援助發展中國家，以及應對冷戰威脅等。此外，演講者還呼籲全國人民團結一致，共同面對挑戰，為美國的未來而努力。 43:30

在要求中以內嵌資料的形式提供音訊檔案

# Download an audio file from a remote server to a cache directory
# on your local host.
curl -o samplesmall.mp3 https://storage.googleapis.com/generativeai-downloads/data/Apollo-11_Day-01-Highlights-10s.mp3

接著，將已下載的小型音訊檔案與提示傳送至 Gemini：

# Initialize a Gemini model appropriate for your use case.
model = genai.GenerativeModel('models/gemini-1.5-flash')

# Create the prompt.
prompt = "請用繁體中文請總結一下音訊。"

# Load the samplesmall.mp3 file into a Python Blob object containing the audio
# file's bytes and then pass the prompt and the audio to Gemini.
response = model.generate_content([
    prompt,
    {
        "mime_type": "audio/mp3",
        "data": pathlib.Path('samplesmall.mp3').read_bytes()
    }
])

# Output Gemini's response to the prompt and the inline audio.
print(response.text)

取得音訊檔案的轉錄稿

# Initialize a Gemini model appropriate for your use case.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")

# Create the prompt.
prompt = "請用繁體中文產生轉錄稿。"

# Pass the prompt and the audio file to Gemini.
response = model.generate_content([prompt, audio_file])

# Print the transcript.
print(response.text)

回答

00:00:03 肯尼迪總統在眾議院羅斯托姆向國會聯席會議發表國情咨文，華盛頓特區，1961 年 1 月 30 日。 00:00:21 議長先生、副總統先生、國會議員們。
00:00:26 很高興能重返我出身之地。
00:00:34 你們是我建歷史友誼的準備，
......
33:42 我要求國會，
33:48 在認為符合國家利益的情況下，
33:54 增加在這個地區使用

參考音訊檔案中的時間戳記

# Create a prompt containing timestamps.
prompt = "提供 02:30 至 03:29 的演講稿。"
# Pass the prompt and the audio file to Gemini.
response = model.generate_content([prompt, audio_file])

# Print the transcript.
print(response.text)

回答

好的，以下是由 02:30 至 03:29 的演講稿：

「我今天演講的時刻，正是國家危機與國家機遇交織之際。
....

特別是，這將使我們能夠應對任何蓄意企圖通過在全球廣泛分散的地區發動有限戰爭來避免或分散我們軍隊的企圖。

第二，我已指示立即採取行動，加快我們的北極星潛艇計畫。現在利用未承擔義務的造船資金，簽訂最初計畫在下一財政年度簽訂的合同，將使我們能夠比計畫提前至少九個月建造並部署更多的關鍵威懾力量，一支永遠不會首先發動攻擊，但擁有足夠的報復力量隱藏在海面之下，以阻止任何侵略者對我們安全發動攻擊的艦隊。

第三，我已指示立即採取行動，加快我們的整個導彈計畫。在國防部長的重新評估完成之前，這裡的重點將主要放在改進組織和決策上，減少浪費的重複和延誤我們整個導彈家族的時間延誤。如果我們要維護和平，我們需要一支刀槍不入的導彈部隊，其威力足以嚇阻任何侵略者，即使是威脅發動攻擊，因為他知道自己無法摧毀我們足夠多的部隊，以防止自身遭到摧毀。因為正如我在宣誓就職時所說，只有當我們的武器毫無疑問地足夠強大時，我們才能毫無疑問地確定它們永遠不會被使用。」

我希望這對您有幫助。如果您有任何其他問題，請告訴我。

計算符記數量

model.count_tokens([audio_file])

回答

total_tokens: 83552

Gemini 多模態大型語言模型大小事 Day8 - 運用 Gemini API 探索視覺功能

Gemini 多模態大型語言模型大小事 Day10 - 程式碼執行

系列文

Gemini 多模態大型語言模型大小事共 30 篇

RSS系列文訂閱系列文

8 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19866 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Gemini 多模態大型語言模型大小事系列 第 9 篇